432 Class 05

Thomas E. Love, Ph.D.

2024-01-30

Today’s Agenda

  • The HELP study (today’s data) and preliminaries
  • Using ols to fit a linear model
    • Obtaining coefficients and basic summaries
    • Validating summary statistics like \(R^2\)
    • ANOVA in ols
    • Plot Effects with summary and Predict
    • Building and using a nomogram
    • Evaluating Calibration
    • Influential points and dfbeta
  • Spending Degrees of Freedom on Non-Linearity
    • The Spearman \(\rho^2\) (rho-squared) plot
  • Building Non-Linear Predictors in ols
    • Polynomial Functions
    • Restricted Cubic Splines
    • Restricting Interaction (Product) Terms

Today’s R Setup

knitr::opts_chunk$set(comment = NA)

library(mosaic)            ## auto-loads mosaicData
library(janitor)
library(knitr)
library(broom)
library(patchwork)
library(GGally)            ## for scatterplot matrix
library(rms)               ## auto-loads Hmisc
library(tidyverse)

theme_set(theme_bw()) 

Today’s Data, from the HELP study

Today’s Data (helpdat, from the HELP study)

Today’s data set comes from the Health Evaluation and Linkage to Primary Care trial, and is stored as HELPrct in the mosaicData package.

HELP was a clinical trial of adult inpatients recruited from a detoxification unit. Patients with no primary care physician were randomized to receive a multidisciplinary assessment and a brief motivational intervention or usual care, with the goal of linking them to primary medical care.

We will look at 453 subjects with complete data today.

helpdat <- tibble(mosaicData::HELPrct) |>
    select(id, cesd, age, sex, subst = substance,
           mcs, pcs, pss_fr)
    
dim(helpdat)
[1] 453   8

Key Variables for Today

Variable Description
id subject identifier
cesd Center for Epidemiologic Studies Depression measure (higher scores indicate more depressive symptoms)
age subject age (in years)
sex female (n = 107) or male (n = 346)
subst primary substance of abuse (alcohol, cocaine or heroin)
mcs SF-36 Mental Component Score (lower = worse status)
pcs SF-36 Physical Component Score (lower = worse status)
pss_fr perceived social support by friends (higher = more support)
  • All measures from baseline during the subjects’ detoxification stay.
  • More data and details at https://nhorton.people.amherst.edu/help/.

helpdat categorical variables

helpdat |> tabyl(sex, subst) |> 
    adorn_totals(where = c("row", "col")) |>
    adorn_percentages() |>
    adorn_pct_formatting() |>
    adorn_ns(position = "front") |>
    adorn_title(placement = "combined") |>
    kable(align = 'lrrrr')
sex/subst alcohol cocaine heroin Total
female 36 (33.6%) 41 (38.3%) 30 (28.0%) 107 (100.0%)
male 141 (40.8%) 111 (32.1%) 94 (27.2%) 346 (100.0%)
Total 177 (39.1%) 152 (33.6%) 124 (27.4%) 453 (100.0%)

helpdat quantitative variables

helpdat |> select(cesd, age, mcs, pcs, pss_fr) |>
  inspect(digits = 2) 

quantitative variables:  
    name   class  min Q1 median Q3 max mean   sd   n missing
1   cesd integer  1.0 25     34 41  60 32.8 12.5 453       0
2    age integer 19.0 30     35 40  60 35.7  7.7 453       0
3    mcs numeric  6.8 22     29 41  62 31.7 12.8 453       0
4    pcs numeric 14.1 40     49 57  75 48.0 10.8 453       0
5 pss_fr integer  0.0  3      7 10  14  6.7  4.0 453       0

Our Outcome (CES-Depression score)

Describing our outcome CES-D

describe(helpdat$cesd)
helpdat$cesd : CESD at baseline 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     453        0       58    0.999    32.85    14.23     10.0     15.2 
     .25      .50      .75      .90      .95 
    25.0     34.0     41.0     49.0     52.4 

lowest :  1  3  4  5  6, highest: 55 56 57 58 60
  • Info measures the variable’s information between 0 and 1: the higher the Info, the more continuous the variable is (the fewer ties there are.)
  • Gmd = Gini’s mean difference, a robust measure of variation. If you randomly selected two of the 453 subjects many times, the mean difference in cesd would be 14.23 points.

We have some labels in our data

str(helpdat)
tibble [453 × 8] (S3: tbl_df/tbl/data.frame)
 $ id    : int [1:453] 1 2 3 4 5 6 7 8 9 10 ...
  ..- attr(*, "label")= chr "subject ID"
 $ cesd  : int [1:453] 49 30 39 15 39 6 52 32 50 46 ...
  ..- attr(*, "label")= chr "CESD at baseline"
 $ age   : int [1:453] 37 37 26 39 32 47 49 28 50 39 ...
  ..- attr(*, "label")= chr "age (years)"
 $ sex   : Factor w/ 2 levels "female","male": 2 2 2 1 2 1 1 2 1 2 ...
  ..- attr(*, "label")= chr "sex"
 $ subst : Factor w/ 3 levels "alcohol","cocaine",..: 2 1 3 3 2 2 2 1 1 3 ...
  ..- attr(*, "label")= chr "primary substance of abuse"
 $ mcs   : num [1:453] 25.11 26.67 6.76 43.97 21.68 ...
  ..- attr(*, "label")= chr "SF-36 Mental Component Score"
 $ pcs   : num [1:453] 58.4 36 74.8 61.9 37.3 ...
  ..- attr(*, "label")= chr "SF-36 Physical Component Score"
 $ pss_fr: int [1:453] 0 1 13 11 10 5 1 4 5 0 ...
  ..- attr(*, "label")= chr "perceived social support by friends"

Scatterplot Matrix (code)

temp <- helpdat |>
    select(age, mcs, pcs, pss_fr, sex, subst, cesd)

ggpairs(temp)  ## ggpairs from the GGally package

Note that we’re placing the outcome (cesd) last (result on next slide.)

Saving the Data Set

write_rds(helpdat, "c05/data/helpdat.Rds")

Scatterplot Matrix (result)

Using ols to fit a linear regression model

Fitting using ols

The ols function stands for ordinary least squares and comes from the rms package, by Frank Harrell and colleagues. Any model fit with lm can also be fit with ols.

  • To predict var_y using var_x from the my_tibble data, we would use the following syntax:
dd <- datadist(my_tibble)
options(datadist = "dd")

model_name <- ols(var_y ~ var_x, data = my_tibble,
                  x = TRUE, y = TRUE)

This leaves the following questions:

  1. What’s the datadist stuff doing?
  2. Why use x = TRUE, y = TRUE in the fit?

What is datadist?

Before we fit any ols model to data from my_tibble, we’ll use:

dd <- datadist(my_tibble)
options(datadist = "dd")

Run (the datadist code above) once before any models are fitted, storing the distribution summaries for all potential variables. Adjustment values are 0 for binary variables, the most frequent category (or optionally the first category level) for categorical (factor) variables, the middle level for ordered factor variables, and medians for continuous variables.

  • excerpted from the datadist documentation

Why use x = TRUE, y = TRUE in the fit?

Once we’ve set up the distribution summaries with the datadist code, we fit linear regression models using the same fitting routines as lm with ols:

model_name <- ols(var_y ~ var_x, data = my_tibble,
                  x = TRUE, y = TRUE)
  • ols stores additional information beyond what lm does
  • x = TRUE and y = TRUE save even more expanded information that we’ll need in building plots and summaries of the fit.
  • The defaults are x = FALSE, y = FALSE, but in 432, we’ll always want these to be saved.

Using ols to fit a Two-Predictor Model

Now, we’ll fit an ols model predicting our outcome (cesd) using two predictors (mcs and subst) using the helpdat tibble.

  • Start with setting up the datadist
  • Then fit the model, including x = TRUE, y = TRUE
dd <- datadist(helpdat)
options(datadist = "dd")

mod1 <- ols(cesd ~ mcs + subst, data = helpdat,
                 x = TRUE, y = TRUE)

Contents of mod1?

  • Likelihood Ratio Test?
  • What is the discrimination index g?

New elements in ols

For our mod1,

  • Model Likelihood Ratio test output includes LR chi2 = 295.10, d.f. = 3, Pr(> chi2) = 0.0000

The log of the likelihood ratio, multiplied by -2, yields a test against a \(\chi^2\) distribution. Interpret this as a goodness-of-fit test that compares mod1 to a null model with only an intercept term. In ols this is similar to a global (ANOVA) F test.

  • Under the \(R^2\) values, we have g = 9.827.
  • This is the \(g\)-index, based on Gini’s mean difference. If you randomly selected two of the subjects in the model, the average difference in predicted cesd will be 9.827.
  • This can be compared to the Gini’s mean difference for the original cesd values, from describe, which was Gmd = 14.23.

Validate the summary statistics of an ols fit

  • Can we validate summary statistics by resampling?
  • The data used to fit the model provide an over-optimistic view of the quality of fit.
  • We’re interested here in assessing how well the model might work in new data, and to do so, we can use a resampling approach.
  • Consider \(R^2\) here…

Interpreting the Resampling Validation Results

index.orig training test optimism index.corrected n
\(R^2\) 0.4787 0.4874 0.4737 0.0137 0.4650 40
  • index.orig for \(R^2\) is 0.4787. That’s what we get from the data we used to fit the model, and is what we see in our standard output.
  • With validate we create 40 (by default) bootstrapped resamples of the data and then split each of those into training and test samples.
    • For each of the 40 splits, R refits the model (same predictors) in the training sample to obtain \(R^2\): mean across 40 splits is 0.4874
    • Check each model in its test sample: average \(R^2\) was 0.4737
  • optimism = training result - test result = 0.0137
  • index.corrected = index.orig - optimism = 0.4650

While our nominal \(R^2\) is 0.4787 for mod1, correcting for optimism yields a validated \(R^2\) of 0.4650

  • \(R^2 = 0.4650\) better estimates how mod1 will perform in new data.

ANOVA for mod1 fit by ols

  • This adds a line for the complete regression model (both terms) which can be helpful, but is otherwise the same as anova after lm.
  • As with lm, this is a sequential ANOVA table, so if we had included subst in the model first, we’d get a different SS, MS, F and p for mcs and subst, but the same REGRESSION and ERROR results.

summary for mod1 fit by ols

  • How do we interpret the subst effects estimated by this model?
    • Effect of subst being cocaine instead of alcohol on ces_d is -3.44 assuming no change in mcs, with 90% CI (-5.10, -1.79).
    • Effect of subst being heroin instead of alcohol on ces_d is -1.78 assuming no change in mcs, with 90% CI (-3.54, -0.02).

But what about the mcs effect?

summary for mod1 fit by ols

  • Effect of mcs: -12.66 is the estimated change in cesd associated with a move from mcs = 21.68 (see Low value) to mcs = 40.94 (the High value) assuming no change in subst.
  • ols chooses the Low and High values from the interquartile range.
quantile(helpdat$mcs, c(0.25, 0.75))
     25%      75% 
21.67575 40.94134 

Plot the summary to see effect sizes

  • Goal: plot effect sizes for similar moves within predictor distributions.
plot(summary(mod1))
  • The triangles indicate the point estimate, augmented with confidence interval bars.
    • The 90% confidence intervals are plotted with the thickest bars.
    • The 95% CIs are then shown with thinner, more transparent bars.
    • Finally, the 99% CIs are shown as the longest, thinnest bars.

What do the individual effects look like?

ggplot(Predict(mod1, conf.int = 0.95), layout = c(1,2))
  • The left plot shows the impact of changing mcs on cesd holding subst at its baseline level (alcohol).
  • The right plot shows the impact of changing subst on cesd holding mcs at its median value which is 28.602417.
  • Defaults: add 95% CI bands and layout tries for a square.

Build a nomogram for the ols fit

plot(nomogram(mod1))

Nomograms

For complex models (this model isn’t actually very complex) it can be helpful to have a tool that will help you see the modeled effects in terms of their impact on the predicted outcome.

A nomogram is an established graphical tool for doing this.

  • Find the value of each predictor on its provided line, and identify the “points” for that predictor by drawing a vertical line up to the “Points”.
  • Then sum up the points over all predictors to obtain “Total Points”.
  • Draw a vertical line down from the “Total Points” to the “Linear Predictor” to get the predicted cesd for this subject.

Using the nomogram for the mod1 fit

Predicted cesd for a subject with mcs = 35 and subst = heroin?

Actual Prediction for such a subject…

  • The predict function for our ols fit provides fitted values.
predict(mod1, 
           newdata = tibble(mcs = 35, subst = "heroin"))
       1 
30.52766 
  • The broom package can also support rms fits
augment(mod1, 
           newdata = tibble(mcs = 35, subst = "heroin"))
# A tibble: 1 × 3
    mcs subst  .fitted
  <dbl> <chr>    <dbl>
1    35 heroin    30.5

Assessing the Calibration of mod1

We would like our model to be well-calibrated, in the following sense…

  • Suppose our model assigns a predicted outcome of 6 to several subjects. If the model is well-calibrated, then we expect the mean of those subjects’ actual outcomes to be very close to 6.

We’d like to look at the relationship between the observed cesd outcome and our predicted cesd from the model.

  • The calibration plot we’ll create provides two estimates (with and without bias-correction) of the predicted vs. observed values of our outcome, and compares these to the ideal scenario (predicted = observed).
  • The plot uses resampling validation to produce bias-corrected estimates and uses lowess smooths to connect across predicted values.
  • Calibration plots require x = TRUE, y = TRUE in the ols fit.

Calibration Plot for mod1

set.seed(43299); plot(calibrate(mod1))

n=453   Mean absolute error=0.128   Mean squared error=0.02428
0.9 Quantile of absolute error=0.267

Influential Points for mod1?

The dfbeta value for a particular subject and coefficient \(\beta\) is the change in the coefficient that happens when the subject is excluded from the model.

which.influence(mod1, cutoff = 0.2)
$Intercept
[1] "8"   "351" "405" "433"

$mcs
[1] "351" "402" "450"

$subst
[1] "351"
  • These are the subjects that have absolute values of dfbetas that exceed the specified cutoff (default is 0.2 but it’s an arbitrary choice.)

Show the influential points more directly?

w <- which.influence(mod1, cutoff = 0.2)
d <- helpdat |> select(mcs, subst, cesd) |> data.frame()
show.influence(w, d)
    Count       mcs    subst
8       1   9.16053  alcohol
351     3 *57.48944 *heroin 
402     1 *55.47938  alcohol
405     1  15.07887  alcohol
433     1  18.59431  alcohol
450     1 *62.17550  alcohol
  • Count = number of coefficients where this row appears influential.
  • Use helpdat |> slice(351) to see row 351 in its entirety.
  • Use residual plots (with an lm fit) to check Cook’s distances.

Residuals vs. Fitted Values is easy from ols

plot(resid(mod1) ~ fitted(mod1))

Fitting all Residual Plots for mod1

To fit more complete residual plots (and other things) we can fit the lm version of this same model…

mod1_lm <- lm(cesd ~ mcs + subst, data = helpdat)

par(mfrow = c(2,2)); plot(mod1_lm); par(mfrow = c(1,1))
  • Plots are shown on the next slide. While the subject in row 351 is more influential than most other points, it doesn’t reach the standard of a problematic Cook’s distance.

Residual Plots for mod1

Thinking about Non-Linear Terms?

Non-Linear Terms

In building a linear regression model, we’re most often going to be thinking about:

  • for quantitative predictors, some curvature…
    • perhaps polynomial terms
    • but more often restricted cubic splines
  • for any predictors, possible interactions
    • between categorical predictors
    • between categorical and quantitative predictors
    • between quantitative predictors

Polynomial Regression

A polynomial in the variable x of degree D is a linear combination of the powers of x up to D. Fitting such a model creates a polynomial regression.

  • Linear: \(y = \beta_0 + \beta_1 x\)
  • Quadratic: \(y = \beta_0 + \beta_1 x + \beta_2 x^2\)
  • Cubic: \(y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3\)
  • Quartic: \(y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \beta_4 x^4\)
  • Quintic: \(y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \beta_4 x^4 + \beta_5 x^5\)

An orthogonal polynomial sets up a model design matrix and then scales those columns so that each column is uncorrelated with the previous ones.

  • This reduction in collinearity (correlation between predictors) lets us gauge whether the addition of any particular polynomial term improves model fit.

A new predictor: use pcs to predict cesd?

  • Let’s look at both a linear fit and a loess smooth to see if they indicate meaningfully different things about the association between pcs and cesd
ggplot(helpdat, aes(x = pcs, y = cesd)) + 
    geom_point(size = 2) +
    geom_smooth(method = "loess", formula = y ~ x, 
                se = FALSE, col = "blue") +
    geom_smooth(method = "lm", formula = y ~ x,
                se = FALSE, col = "red") + 
    labs(title = "Linear and Loess Fits for cesd vs. pcs")

Linear and Loess Fits for cesd with pcs

Fitting polynomial regressions with ols

dd <- datadist(helpdat)
options(datadist = "dd")

mod_B1 <- ols(cesd ~ pcs, 
              data = helpdat, x = TRUE, y = TRUE)
mod_B2 <- ols(cesd ~ pol(pcs, 2), 
              data = helpdat, x = TRUE, y = TRUE)
mod_B3 <- ols(cesd ~ pol(pcs, 3),
              data = helpdat, x = TRUE, y = TRUE)
  • Note the use of pol() from the rms package here to fit orthogonal polynomials, rather than poly() which we used for an lm fit.

Model B1 (linear in pcs)

Model B2 (quadratic polynomial in pcs)

Model B3 (cubic polynomial in pcs)

Store the polynomial fits

First, we need to store the values. Again broom doesn’t play well with ols fits, so I’ll just add the predictions as columns

cesd_fits <- helpdat |>
    mutate(fitB1 = predict(mod_B1),
           fitB2 = predict(mod_B2),
           fitB3 = predict(mod_B3))

Code to plot polynomial fits

ggplot(cesd_fits, aes(x = pcs, y = cesd)) +
    geom_point() +
    geom_line(aes(x = pcs, y = fitB1),
              col = "blue", size = 1.25) +
    geom_line(aes(x = pcs, y = fitB2),
              col = "black", size = 1.25) +
    geom_line(aes(x = pcs, y = fitB3),
              col = "red", size = 1.25) +
    geom_text(x = 18, y = 47, label = "Linear Fit", 
              size = 5, col = "blue") +
    geom_text(x = 18, y = 39, label = "Quadratic Fit", 
              size = 5, col = "black") +
    geom_text(x = 18, y = 26, label = "Cubic Fit", 
              size = 5, col = "red") +
    labs(title = "Linear, Quadratic and Cubic Fits for cesd using pcs") 

The Polynomial Fits, plotted

Code to plot polynomial fits with Predict

p1 <- ggplot(Predict(mod_B1)) + ggtitle("B1: Linear")
p2 <- ggplot(Predict(mod_B2)) + ggtitle("B2: Quadratic")
p3 <- ggplot(Predict(mod_B3)) + ggtitle("B3. Cubic")

p1 + p2 + p3

Visualizing the polynomial fits with Predict

Splines

  • A linear spline is a continuous function formed by connecting points (called knots of the spline) by line segments.
  • A restricted cubic spline is a way to build highly complicated curves into a regression equation in a fairly easily structured way.
  • A restricted cubic spline is a series of polynomial functions joined together at the knots.
    • Such a spline gives us a way to flexibly account for non-linearity without over-fitting the model.
    • Restricted cubic splines can fit many different types of non-linearities.
    • Specifying the number of knots is all you need to do in R to get a reasonable result from a restricted cubic spline.

The most common choices are 3, 4, or 5 knots.

  • 3 Knots, 2 degrees of freedom, allows the curve to “bend” once.
  • 4 Knots, 3 degrees of freedom, lets the curve “bend” twice.
  • 5 Knots, 4 degrees of freedom, lets the curve “bend” three times.

Fitting Restricted Cubic Splines with ols

Let’s consider a restricted cubic spline model for cesd based on pcs with:

  • 3 knots in modC3, 4 knots in modC4, and 5 knots in modC5
dd <- datadist(helpdat)
options(datadist = "dd")

mod_C3 <- ols(cesd ~ rcs(pcs, 3), 
              data = helpdat, x = TRUE, y = TRUE)
mod_C4 <- ols(cesd ~ rcs(pcs, 4), 
              data = helpdat, x = TRUE, y = TRUE)
mod_C5 <- ols(cesd ~ rcs(pcs, 5),
              data = helpdat, x = TRUE, y = TRUE)

Model C3 (3-knot spline in pcs)

Model C4 (4-knot spline in pcs)

Model C5 (5-knot spline in pcs)

Code to plot all six fits

p1 <- ggplot(Predict(mod_B1)) + ggtitle("B1: Linear")
p2 <- ggplot(Predict(mod_B2)) + ggtitle("B2: Quadratic")
p3 <- ggplot(Predict(mod_B3)) + ggtitle("B3. Cubic")
p4 <- ggplot(Predict(mod_C3)) + ggtitle("C3: 3-knot RCS")
p5 <- ggplot(Predict(mod_C4)) + ggtitle("C4. 4-knot RCS")
p6 <- ggplot(Predict(mod_C5)) + ggtitle("C5. 5-knot RCS")

(p1 + p2 + p3) / (p4 + p5 + p6)

Visualizing the fits better?

Which of these models looks better?

  • Compare our six models for the cesd to pcs association
  • I used set.seed(432) then validate(mod_B1) etc.
Model Index-Corrected \(R^2\) Corrected MSE
B1 (linear) 0.0848 143.25
B2 (quadratic) 0.0752 142.49
B3 (cubic) 0.0909 143.73
C3 (3-knot RCS) 0.0732 143.31
C4 (4-knot RCS) 0.0870 144.00
C5 (5-knot RCS) 0.0984 141.44
  • So which model has the best (validated) summaries?
  • We’d need to look at residual plots, too, of course.

Data Spending: Non-Linearity Prior to Fits

Spending degrees of freedom wisely

  • Suppose we have a data set with many possible predictors, and minimal theory or subject matter knowledge to guide us.
  • We might want our final inferences to be as unbiased as possible. To accomplish this, we have to pay a penalty (in terms of degrees of freedom) for any “peeks” we make at the data in advance of fitting a model.
  • So that rules out a lot of decision-making about non-linearity based on looking at the data, if our sample size isn’t much larger than 15 times the number of predictors we’re considering including in our model.
  • In our case, we have n = 453 observations on 6 candidate predictors.
  • In addition, adding non-linearity to our model costs additional degrees of freedom.
  • What can we do?

Spearman’s \(\rho^2\) plot: A smart first step?

Spearman’s \(\rho^2\) is an indicator (not a perfect one) of potential predictive punch, but doesn’t give away the game.

  • Idea: Perhaps we should focus our efforts re: non-linearity on predictors that score better on this measure.
spear_cesd <- spearman2(cesd ~ mcs + subst + pcs + 
                            age + sex + pss_fr, 
                        data = helpdat)

Spearman’s \(\rho^2\) Plot

plot(spear_cesd)

Conclusions from Spearman \(\rho^2\) Plot

  • mcs is the most attractive candidate for a non-linear term, as it packs the most potential predictive punch, so if it does turn out to need non-linear terms, our degrees of freedom will be well spent.
    • This does not mean that mcs actually needs a non-linear term, or will show meaningfully better results if a non-linear term is included. We’d have to fit a model with and without non-linearity in mcs to know that.
    • Non-linearity will often take the form of a product term, a polynomial term, or a restricted cubic spline.
  • pcs, also quantitative, has the next most potential predictive punch
  • these are followed by pss_fr and sex.

Grim Reality

With 453 observations (452 df) we should be thinking about models with modest numbers of regression inputs.

  • Non-linear terms (polynomials, splines) just add to the problem, as they need additional df to be estimated.

In this case, we might choose to include non-linear terms in just two or three variables (and that’s it) and even that would be tough to justify with this modest sample size.

Contents of spear_cesd

spear_cesd

Spearman rho^2    Response variable:cesd

        rho2      F df1 df2      P Adjusted rho2   n
mcs    0.438 350.89   1 451 0.0000         0.436 453
subst  0.034   7.97   2 450 0.0004         0.030 453
pcs    0.089  44.22   1 451 0.0000         0.087 453
age    0.000   0.12   1 451 0.7286        -0.002 453
sex    0.033  15.56   1 451 0.0001         0.031 453
pss_fr 0.033  15.57   1 451 0.0001         0.031 453

Proposed New Model

Fit a model to predict cesd using:

  • a 5-knot spline on mcs
  • a 3-knot spline on pcs
  • a linear term on pss_fr
  • a linear term on age
  • an interaction of sex with the main effect of mcs (restricting our model so that terms that are non-linear in both sex and mcs are excluded), and
  • a main effect of subst

Perhaps more than we can reasonably do with 453 observations, but let’s see how it looks.

Our new model mod2

dd <- datadist(helpdat)
options(datadist = "dd")

mod2 <- ols(cesd ~ rcs(mcs, 5) + rcs(pcs, 3) + sex + 
                mcs %ia% sex + pss_fr + age + subst, 
            data = helpdat, x = TRUE, y = TRUE)
  • %ia% tells R to fit an interaction term with sex and the main effect of mcs.
  • We have to include sex as a main effect for the interaction term (%ia%) to work here.

Our new, more complex model mod2

ANOVA for this model

  • Remember that this ANOVA testing is sequential, other than the TOTALs.
  • We can also plot the ANOVA results, for example…

Plotting ANOVA results for mod2

plot(anova(mod2))

Validation of Summary Statistics

summary results for mod2

Plot of summary results for mod2

plot(summary(mod2))

Nomogram for mod2

plot(nomogram(mod2))

Seeing the impact of the modeling another way

ggplot(Predict(mod2))

Residuals vs. Fitted Values to check assumptions

plot(resid(mod2) ~ fitted(mod2))

Checking the model’s calibration

set.seed(432); plot(calibrate(mod2))

n=453   Mean absolute error=0.386   Mean squared error=0.19775
0.9 Quantile of absolute error=0.704

Limitations of lm for fitting complex linear models

We can certainly assess this big, complex model using lm, too:

  • with in-sample summary statistics like adjusted R2, AIC and BIC,
  • we can assess its assumptions with residual plots, and
  • we can also compare out-of-sample predictive quality through cross-validation,

But to really delve into the details of how well this complex model works, and to help plot what is actually being fit, we’ll probably want to fit the model using ols.

  • In Project A, we expect some results that are most easily obtained using lm and others that are most easily obtained using ols.

Next Time

  • The HERS data
  • Fitting a more complex linear regression model
  • Adding missing data into all of this, and using multiple imputation